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© Device for managing voice data. 



© The present invention concerns a device for 
managing voice data. The embodiment described 
comprises means (20) for displaying a visual repre- 
sentation of a voice message and means for as- 
sociating markers (42,44,46,48) with segments of the 
message. The markers (42,44,46.48) are indicative of 
particular storage areas eg a telephone number stor- 
age area, a calendar storage area etc. Association of 
a marker (42,44,46,48) with a segment of a voice 
message automatically causes that segment to be 
linked with the corresponding storage area so that 
the segment can later be retrieved in the context of 
a user interface for that particular storage area. 
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Technical Field 

The present invention relates to a device de- 
signed to facilitate the management of voice data. 
Voice messages, left on a recipient's answerphone 5 
or delivered via a voicemail system are a popular 
form of person-to-person communication. Such 
voice messages are quick to generate for the send- 
er but are relatively difficult to review for the recipi- 
ent; speech is slow to listen to and, unlike inher- to 
ently visual forms of messages such as electronic 
mail or handwritten notes, cannot be quickly 
scanned for the relevant information. The present 
invention aims to make it easier for users to extract 
relevant information from voice messages, and oth- 15 
er kinds of voice record, such as recordings of 
meetings and recorded dictation. 

In the long-term it would be desirable to ap- 
proach this problem by automatically translating 
speech into text using speech recognition. Unfortu- 20 
nately this approach is not yet practical, since 
current speech recognition technology cannot ac- 
curately transcribe naturally-occurring speech of 
the kind found in voice messages. Therefore a 
number of approaches have been developed which 25 
help users to review voice data without actually 
recognising the speech signal and which provide 
for the display, structuring and annotation of 
speech recordings. 

30 

Background Art 

Many approaches assume, but do not nec- 
essarily depend on, an underlying technique for 
displaying a visual representation of speech. One 35 
such form of display is a single graphical line, 
graduated with time markings from start to finish 
(for example, a 4 second message may contain the 
appropriately spaced labels "0 sec", "1 sec", "2 
sec", "3 sec", "4 sec"). In addition, an algorithm 40 
can be used to process the speech record to 
distinguish the major portions of speech from the 
major portions of silence. Such an algorithm is 
described by Arons (1994, Chapter 4). This permits 
a richer form of graphical display, in which the 45 
speech record is still portrayed along a timeline, 
but with portions of speech displayed as dark seg- 
ments (for example) and the detected portions of 
silence displayed as light segments. Four pieces of 
prior art will be referred to: so 
1. A paper in the proceedings of CHI '92 entitled 
"Working with Audio: Integrating Personal Tape 
Recorders and Desktop .Computers" by Degen, 
Mander and Saloman (1992) describes a proto- 
type hand-held personal tape recorder. This is 55 
similar to a conventional "dictaphone" except 
that the user can place index points on the 
recording by pressing a button at the appro- 
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priate point in the recording. Two index buttons 
are available and these have no predetermined 
meaning. The user is free to place their own 
interpretation on the two forms of index. The 
recording can be downloaded to a personal 
computer and the inserted index points can be 
displayed along the timeline of the message. By 
visually displaying the index points, the user is 
reminded of an area of interest in the speech 
recording and can selectively play back portions 
of speech by using a pointing device such as a 
mouse. In addition, the index points can be 
searched for within the recording. 

2. The NoteTaker product from Irik'Ware Devel- 
opment Corp. (1994) extends this idea in the 
context of computer-based handwritten notes, 
rather than speech. . Here users can select one 
of a variety of visual labels, representing for 
example "Urgent!", "Call" or "Action", and as- 
sociate these with selected parts of a hand- 
written note. The program then allows the user 
to find all notes containing a particular label, an 
"Action" item for example. 

3. Ades and Swinehart (1986) have built a proto- 
type system for annotating and editing speech 
records. This system is the subject of their 
paper entitled "Voice Annotation and Editing in 
a Workstation Environment" from Xerox Cor- 
poration. In particular, an arbitrary text annota- 
tion can be placed on a visually displayed seg- 
ment of speech as a cue to the content of that 
portion of speech. 

4. A paper entitled " Capturing, Structuring and 
Representing Ubiquitous Audio" by Hindus, 
Schmandt and Horner (ACM Transactions on 
Information Systems, Vol 11, No.4 October 
1993. pages 376-400) describes a prototype 
system for handling speech which allows the 
user to select a portion of visually displayed 
speech and to associate the depicted speech 
portion (such as by "drag-and-drop" using a 
mouse) with another application, such as a cal- 
endar. The calendar may contain independently 
entered, standard textual data (such as "Meeting 
with Jim"), as well as audio annotations and 
additions associated in this way. 

Referring to the prior art items numbered 1-4 
above, approaches (1) - (3) offer annotations which 
the user can employ as a visual cue to relevant 
parts of the speech (or handwriting, in the case of 
(2)). In (1), two labels are available with no 
predefined meaning. In (2), the user can choose 
from a broader set of labels, the appearance of 
which suggests a particular use (eg. the user 
should use the "Call" label for tagging items about 
telephoning people). In (3), the user can tag 
speech with an arbitrary textual entry, thus provid- 
ing an even richer form of annotation. However, in 
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all these approaches the label plays only a passive 
role in organising the target data. It is a passive 
visual and searchable cue to parts of the speech, 
and does not help the broader integration of the 
speech with other relevant applications in the 
user's personal information environment. 

Approach (4) addresses this problem by allow- 
ing users to associate selected speech clips into, 
for example, a text-based calendar. A disadvantage 
of this approach is that it is rather laborious - the 
user must identify the appropriate speech clip, 
select it, and then associate it with another applica- 
tion. In addition, not all user interfaces lend them- 
selves to this approach. 

Disclosure of Invention 

According to the present invention we provide 
a device for storing speech input comprising: 
means for specifying a marker having a particular 
connotation; 

means for associating the marker with all or part of 
the speech input; 

and means for automatically linking the speech 
input associated with the marker to a correspond- 
ing storage area for later retrieval by the user in the 
context of a user interface which is dependent on 
the connotation of the associated marker. 

A device according to the present invention 
has the advantage of providing a simple and con- 
venient way of integrating voice data with other 
user applications so as to facilitate the manage- 
ment of voice data. In the embodiment to be de- 
scribed, the corresponding storage areas include 
telephone book and calendar application storage 
areas. 

Preferably, the means for specifying a marker 
comprises means for selecting a marker from a set 
of markers. The set of markers preferably comprise 
iconic representations of the corresponding storage 
areas. 

In the embodiment to be described there are 
means for displaying a representation of the 
speech input. This allows a user to view a visual 
representation of voice data on a desktop computer 
display. In that embodiment, there are means for 
automatically segmenting the speech input, specifi- 
cally for automatically segmenting the speech input 
into silent and non-silent parts. 

The marker may be associated with a part of 
the speech input by time synchronisation. This 
approach conflates the selection of a marker and 
its association with a segment of speech data in a 
manner which may be particularly convenient for 
users. Alternatively, the marker may be associated 
with a part of the speech input by user input The 
user input may comprise manipulation of an input 
device eg. dragging and dropping a marker icon on 



the relevant speech segment using a mouse. Alter- 
natively, the user input may comprise means for 
associating a marker with a part of the speech 
input by spoken commands, 
s The linking means may comprise means for 

copying the speech input associated with the mark- 
er to the corresponding storage area. Alternatively, 
the linking means may comprise means for moving 
the speech input associated with the marker to the 
io corresponding storage area. Another possibility is 
for the linking means to comprise means for pro- 
viding a pointer to the speech input associated with 
the marker in the corresponding storage area. It 
may also be useful for the linking means to corn- 
's prise means for providing an index into the original 
voice data containing the speech input associated 
with the marker. 

Brief Description of Drawings 

20 

Particular embodiments of the present inven- 
tion will now be described, by way of example, with 
reference to the accompanying drawings of which: 
Figure 1 depicts the user interface of a device 
25 according to a first embodiment of the present 
invention; 

Figure 2 depicts the user interface of Figure 1 
after labelling of two speech segments; 
Figure 3 depicts the user interface of a known 
30 telephone book application. 

Best Mode for Carrying Out the Invention & Indus- 
trial Applicability 

35 The present invention can be implemented in 

the context of a "Personal Message Manager" ap- 
plication for browsing voice messages. 

The embodiment to be described with refer- 
ence to Figures 1 to 3 is written in Microsoft Visual 

40 Basic and Borland C on a IBM-compatible 486 
25MHz 'Personal Computer, and runs under the 
Microsoft Windows 3.1 operating system. Audio 
recording and playback facilities are supported by 
a SoundBlaster 16 ASP card (Creative Labs, Inc.). 

45 These facilities are accessed through the standard 
MS Windows MultiMedia Application Programmers' 
Interface. Speech records are created using a mi- 
crophone connected to the audio card, and played 
back via a set of speakers also connected to the 

so card. On recording, the audio card translates the 
analogue audio signal produced by the microphone 
into a standard digital representation of the re- 
corded speech, and stores the data in the standard 
M .wav" file format. The card performs the converse 

55 digital-to-analogue conversion in order to play back 
a digital ".wav" file through loudspeakers. 
User input is by means of a mouse. 
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Figure 1 shows an interaction screen 10 in a 
Microsoft Windows user interface. A set of folders 
represented by icons 12 are for storing previous 
voice messages. One of the folders 14 has been 
selected which causes the "header" information for 
each message in the selected folder to be dis- 
played in a display box 16. The display box 16 
displays the date of receipt and the sender of each 
message. Figure 1 shows the topmost message 1 8 
having been selected. This causes the selected 
message 18 to be displayed as a series of blocks 
in another display box 20. In the display box 20, 
dark blocks represent speech and white blocks 
represent silence. A known speech processing al- 
gorithm is utilised to distinguish between the major 
segments of speech and silence; such an algorithm 
is described in the paper by Arons (1994, Chapter 

Above the display box 20 is a set of audio 
controls 22 to allow the user to play, pause and 
stop speech playback. The audio controls 22 com- 
prise the following button representations: 
a play button 24; 
a pause button 26; 
a stop button 28; 

a previous button 30 to skip playback to the pre- 
vious segment of speech; 

a next button 32 to skip playback to the next 
segment of speech; 

a repeat button 34 to repeat playback of the most 

recently played segment of speech; 

a speed control button 36 to vary the playback 

speed. 

The user can also click directly on a segment 
of speech in the display box 20 eg using a mouse, 
to play back that specific segment. In Figure 1 , an 
arrow-shaped cursor 38 is shown in the display box 
20 to indicate that playback is ready to commence 
at the beginning of the speech file. As a speech 
segment is being played, its colour changes to give 
the user a cue to the current position in the speech 
record. 

To the right of the display box 20 is a panel 40 
of markers 42,44,46,and 48 for labelling portions of 
the recorded speech. These can be used to pro- 
vide a visual cue to the contents of a message. 
There are markers corresponding to a Phone Book 
42, a Time/Appointment diary 44, a 
Memo/Reminder list 46, and a miscellaneous 
Points of Interest area 48. For example, one seg- 
ment of the message 18 may contain a speech 
segment such as "If you need to get back to me, 
my number is 228 455:" This segment could be 
labelled with the Phone marker 42. Whenever a 
marker is placed on a speech segment in the 
display box 20, that segment of speech is auto- 
matically linked to a corresponding application in 
the user's computer system. This automatic linking 



of speech segments to other applications using 
visual markers is convenient for the user and is an 
important step towards integrating the various ap- 
plications relevant to handling voice data. 

5 Figure 2 depicts a situation in which the user 

has labelled two segments of speech, 50 and 52, 
the segment 50 as a Memo, and the segment 52 
as a Phone item. This is accomplished by clicking 
the appropriate marker during playback of the rel- 

io evant speech segment; the system then associates 
an instance of this marker with the segment of 
speech being played and provides a visual repre- 
sentation of the marker above the segment in the 
display box 20 as shown. 

is As well as providing a visual cue to the content 

of the speech record, placing markers against 
speech segments in the display box 20 automati- 
cally links the labelled segments to an appropriate 
computer application. For example, marking the 

20 message with the Phone label 42 as shown in 
Figure 2 causes the marked segment of speech to 
be automatically added to a standard, textual 
Phone Book application, depicted in Figure 3. The 
* Phone Book 1 window comprises a display box 54 

25 listing the entries in the directory and two buttons, 
and 'Add 1 button 56 and a 'Delete' button 58 for 
use when adding and deleting entries in the list. 

Items in the display box 54 which have voice 
data associated with them are indicated explicitly 

30 eg item 60 in Figure 3. Selecting such an item in 
the display box 54 causes the appropriate speech 
clip to be played back. 

An advantage of the approach described above 
is that it provides a very quick and easy method of 

35 capturing and storing information, whilst it is lis- 
tened to in spoken form. Later, at a time more 
convenient to the user, he/she can transcribe this 
portion of speech into a full textual phone book 
entry if desired. 

40 In order to associate the selected marker with a 

specific segment of speech, it is necessary to 
determine the segment of speech that is currently 
being played. There are a number of ways in which 
this can be implemented and one method is de- 

45 scribed here. Assume the algorithm used for 
speech/silence detection (such as Arons, 1994) has 
produced a data file indicating the times in the 
speech file of speech and silence. 
For example: 

so Speech (1): 0 millisecond (ms) to 800ms 
Silence: 801 ms to 1 050ms 

Speech (2): 1051ms to 3405ms 
Silence: 3406ms to 3920ms , 

Speech (3): 3921ms to 6246ms 
55 Suppose the speech message is played back from 
the start of the message. At the start of the 
playback, an internal clock is set to 0ms to track 
the time. If the user selects (ie. clicks) a marker, 
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the time is noted say, 5324ms, and then the 
speech/si lence data file, illustrated above, is 
searched to see to which segment this time cor- 
responds, in the above example this implies that 
the system is currently playing the third speech 
segment. In this way, time-synchronization is used 
to associate a marker with a speech segment. 

In order automatically to link to another ap- 
plication and subsequently to play a speech clip 
from that application, a visual indication of the 
speech within that application is provided and the 
relevant application must be able to play back the 
speech clip directly. This is accomplished using 
standard MS Windows programming techniques. In 
the Phone Book example, an automatically gen- 
erated textual entry is added to the Phone Book 
display (for example, see item 60 in Figure 3). In 
addition, in the underlying data structure, this entry 
is flagged as being voice data and a simple speci- 
fication of where to find the appropriate voice data 
is recorded. This specification comprises a pointer 
to the original ".wav" speech file, along with a 
specification of start and end points within this file 
that represent the speech segment to be accessed. 
These points can be specified as times, byte posi- 
tions, or other representations. When selected, the 
audio Application Programmers' Interface is used 
to play back this segment of speech from within 
the Phone Book application. 

The embodiment described above is a voice 
data management device which is easy to use and 
which integrates voice data into other user applica- 
tions in a convenient manner. Many of the features 
described with reference to this embodiment can 
be modified and categories of these will now be 
addressed. 

1 . Selection of speech marker 

Apart from a mouse, other possible selection 
devices include a pen/stylus, a touch-screen and 
the use of the TAB key on a keyboard for iterative- 
ly cycling through menu selection options dis- 
played to the user. Alternatively, each marker could 
be represented by a dedicated hard button on a 
device implementing the present invention and 
pressed during playback of recorded speech. 

2. Association of markers with speech 

In the embodiment described above, the timing 
of the marker selection governs the speech seg- 
ment with which it is to be associated. An alter- 
native is to allow the user actively to associate a 
marker with the speech segment of interest eg by 
"drag-and-drop". This approach is particularly use- 
ful after the message has been listened to at least 
once when the user is undertaking considered ana- 



lysis and structuring of the speech file. 

An alternative set of approaches conflate the 
selection and association steps. The user may se- 
lect the speech segment of interest, either by ex- 

5 plicit selection with a mouse, or implicit selection 
by time synchronisation, and linguistically specify 
the marker to be associated with that segment. The 
linguistic specification could be made by typing in 
some initial identifying characters of the name of 

io the marker (eg. "ph" for Phone), by drawing or 
hand-writing the name of the marker and using 
handwriting recognition to determine the intended 
marker, or by speaking the name of the marker and 
using speech recognition to identify it. 

75 A final general approach to marker association 

is automatically to identify the appropriate marker 
for a segment of speech by partially recognising 
the speech itself. Here techniques for "word-spot- 
ting" in continuous speech, for example based on 

20 Hidden Markov Models (cf. Wilcox and Bush, 
1991), could determine the likelihood that a certain 
speech segment contains a telephone number. If 
the recognition algorithm predicts a high probabil- 
ity of a phone number, the segment could be 

25 labelled automatically with the Phone marker. 

3. Definition of markers 

It is anticipated that the user may be able to 
30 customise the markers and corresponding storage 
areas available within a system according to the 
present invention. A suite of icons could be made 
available from which the user can choose. In addi- 
tion, the user could define arbitrary text labels and 
35 place these in the panel of markers. The system 
could also allow the user to specify the storage 
area associated with each marker. 

4. Accessing speech segments from target applica- 
40 tion 

The above description assumes that the stor- 
age application (eg. Phone Book) is provided with a 
link to the original speech file. There are various 
45 ways in which this could be implemented: 

i) Copy - a copy of the appropriate speech data 
could be made and stored in a separate file; 

ii) Move - a copy of the appropriate speech data 
could be made and stored in a separate file, and 

so the segment could be removed from the original 
voice record (ie. from the voice message); 

iii) Link - as in the above-described embodi- 
ment, a pointer to the same speech file can be 
provided. 

55 Another approach is to treat the copied/linked 

speech clip as representing an index into the origi- 
nal message. In this case, when the clip is played 
back from the application (eg. the Phone Book), the 
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user has the option of reviewing the entire mes- 
sage from which it was extracted. This is a useful 
enhancement since an automatic segmentation al- 
gorithm will sometimes produce inappropriate seg- 
mentations, for example breaking a telephone num- 
ber in the middle, in which case it is important for 
the user to be able to continue playback (or re- 
wind) after the linked speech clip has been played. 

5. Extracting segments from the original speech 
record 

In the above-described embodiment, the 
speech record is segmented into speech and si- 
lence using an algorithm such as Arons (1994, 
Chapter 4). Alternatively, the original speech record 
could be represented to the user as a continous, 
unstructured line. Markers could be associated with 
this line using the same range of techniques de- 
scribed above and the only difference would be 
that the marker is associated with a point in the 
speech record rather than a segment of speech. 

Automatically storing the speech associated 
with a marker could then be accomplished by 
either (a) arbitrarily defining the segment of interest 
eg. a 5 second clip centred on the marker point, or 
(b) assuming the indexing approach outlined in 
point (4) above, where the storage of the speech in 
the target application is merely a point at which to 
index into the original. 

6. User interface designs 

Whenever a marker is associated with a seg- 
ment of speech, an instance of that marker could 
appear in the "header" line for the message (along 
with date, sender, etc). This would provide a cue to 
the user that the message contains eg. a phone 
number. A possible additional feature would be to 
play back every segment in the relevant message 
which has been associated with this type of marker 
on selection of the header marker by the user eg 
by clicking with the mouse. 

Moreover, a "find" facility could be included 
with the Personal Message Manager which could 
find all messages containing a certain type of 
marker, or combination of marker. 

7. Device without a display 

The present invention also has application in a 
device which lacks a display. Such a device may 
be useful for visually impaired people, for whom 
speech-based information is more useful than vi- 
sual information. Speech messages could be re- 
viewed using a set of hard buttons, similar to those 
used in dictaphones for example, and interesting 
portions of speech could be labelled using a set of 
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hard marker buttons (as described in (1) above). 
Such portions could then be linked as described 
above to speech-based storage areas, such as a 
speech-based phone book. 

5 The present invention is relevant to a range of 

uses of speech data. It may have particular utility 
for users who receive a large amount of voice mail 
containing similar kinds of information. This in- 
formation may not need to be transcribed imme- 

70 diately, but it may help to store the spoken in- 
formation in a structured form. For example, field 
staff may telephone a central office to report the 
time of a repair, the problem diagnosed and the 
work undertaken. This information could be ex- 

75 tracted from the voice messages and categorised 
using the techniques described. 

The invention has been described in terms of a 
program for handling voice messages. However, 
the invention is applicable to all forms of recorded 

20 speech, and the implementation described need 
not necessarily be part of a telecommunications 
system. Other possible uses include the manage- 
ment of voice data comprising recording of meet- 
ings, general conversations and other personal 

25 data. 

Claims 

1. A device for storing speech input comprising: 
30 means for specifying a marker having a par- 
ticular connotation; 

means for associating the marker with all or 
part of the speech input; 
and means for automatically linking the speech 
35 input associated with the marker to a cor- 

responding storage area for later retrieval by 
the user in the context of a user interface 
which is dependent on the connotation of the 
associated marker. 

40 

2. A device according to claim 1 wherein the 
means for specifying a marker comprises 
means for selecting a marker from a set of 
markers. 

45 

3. A device according to claim 2 wherein the set 
of markers comprises iconic representations of 
the corresponding storage areas. 

so 4. A device according to any preceding claim 
comprising means for displaying a representa- 
tion of the speech input. 

5. A device according to claim 4 comprising 
55 means for automatically segmenting the 

speech input. 
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6. A device according to claim 5 comprising 
means for automatically segmenting the 
speech input into silent and non-silent parts. 

7. A device according to any preceding claim 5 
comprising means for associating a marker 

with a part of the speech input by time synch- 
ronisation. 

8. A device according to any of claims 1 to 6 jo 
comprising means for associating a marker 

with a part of the speech input by user input. 

9. A device according to claim 8 comprising 
means for associating a marker with a part of is 
the speech input by manipulation of an input 
device. 

10. A device according to claim 9 comprising 
means for associating a marker with a part of 20 
the speech input by spoken commands. 

11. A device according to any preceding claim 
wherein the linking means comprises means 

for copying the speech input associated with 25 
the marker to the corresponding storage area. 

12. A device according to any one of claims 1 to 
10 wherein the linking means comprises 
means for moving the speech input associated 30 
with the marker to the corresponding storage 
area. 

13. A device according to any one of claims 1 to 

10 wherein the linking means comprises 35 
means for providing a pointer to the speech 
input associated with the marker in the cor- 
responding storage area. 

14. A device according to any one of claims 1 to 40 
10 wherein the linking means comprises 
means for providing an index into the original 
voice data containing the speech input asso- 
ciated with the marker. 

45 
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